Now that we have the Bag-of-Words, we can train machine learning models. We will consider CART, Random Forest, and Boosting models.
First, let’s read in the data that contains the final Bag-of-Words.
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
reviewsBagOfWords <- readRDS("data/reviewsBagOfWords.RDS")
Next, let’s convert the review score category to an ordered level data type. This is helpful when we have 3+ ordered categories, because it will determine how our confusion matrix is displayed for our results.
ratingLevels <- c("Terrible", "Low", "Mid", "High", "Perfect")
reviewsBagOfWords <- reviewsBagOfWords %>%
mutate(review_scores_category =
as.factor(ordered(review_scores_category, levels = ratingLevels)))
## Warning: package 'bindrcpp' was built under R version 3.4.4
This is necessary for Random Forest
numericIndices <- which(grepl("^[0-9]", names(reviewsBagOfWords)))
names(reviewsBagOfWords)[numericIndices] <-
paste0("x", names(reviewsBagOfWords)[numericIndices])
Before we begin, we first split our data into training/testing sets.
library(caTools)
## Warning: package 'caTools' was built under R version 3.4.4
set.seed(123)
X <- as.matrix(reviewsBagOfWords[,4:ncol(reviewsBagOfWords)])
y <- as.vector(reviewsBagOfWords$review_scores_category)
spl <- sample.split(y, SplitRatio = 0.75)
dataTrain <- reviewsBagOfWords[spl,] %>%
select(-listing_id, -review_scores_rating) %>%
rename(Y = review_scores_category)
dataTest <- reviewsBagOfWords[!spl,] %>%
select(-listing_id, -review_scores_rating) %>%
rename(Y = review_scores_category)
xTrain <- X[spl,]
xTest <- X[!spl,]
yTrain <- y[spl]
yTest <- y[!spl]
First, fit a deep CART model (low cp value). We set a seed because the rpart() function performs cross-validation, which is a randomized algorithm.
library(rpart)
## Warning: package 'rpart' was built under R version 3.4.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.4.4
set.seed(123)
treeBig <- rpart(Y ~ ., data = dataTrain, cp = 0.001)
We can plot this initial tree:
prp(treeBig)
Next, we use the built-in cross-validation in CART to see how our model performs for varying values of cp. We can use the
printcp() command to see the cross-validated error for different values of cp, where: - “nsplit” = number of splits in tree - “rel error” = scaled training error - “xerror” = scaled cross-validation error - “xstd” = standard deviation of xerror
printcp(treeBig)
##
## Classification tree:
## rpart(formula = Y ~ ., data = dataTrain, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] access actual airbnb also anoth apart beauti
## [8] bed bit boston call choic clean close
## [15] comfort condit corner das day definit describ
## [22] dirti door enjoy especi everyth excel extrem
## [29] food get good grate great help high
## [36] home hospit hot howev just know like
## [43] locat loud love made make man might
## [50] min nearbi need nice one orang person
## [57] place pretti price public question quick quiet
## [64] rent rental respons restaur run servic shop
## [71] south space spacious station stay street tast
## [78] thank top total two uber use want
## [85] wash wasnt well will wonder
##
## Root node error: 1556/2122 = 0.73327
##
## n= 2122
##
## CP nsplit rel error xerror xstd
## 1 0.2005141 0 1.00000 1.00000 0.013093
## 2 0.0347044 1 0.79949 0.79949 0.014581
## 3 0.0122108 2 0.76478 0.76607 0.014689
## 4 0.0099614 4 0.74036 0.74679 0.014735
## 5 0.0089974 7 0.70694 0.73136 0.014763
## 6 0.0073907 8 0.69794 0.73072 0.014764
## 7 0.0051414 10 0.68316 0.72237 0.014776
## 8 0.0048201 12 0.67288 0.72044 0.014779
## 9 0.0044987 14 0.66324 0.72108 0.014778
## 10 0.0041774 15 0.65874 0.72494 0.014773
## 11 0.0038560 17 0.65039 0.72751 0.014769
## 12 0.0036418 26 0.61568 0.73008 0.014765
## 13 0.0032134 30 0.60026 0.74614 0.014737
## 14 0.0028920 31 0.59704 0.74486 0.014739
## 15 0.0025707 43 0.55334 0.74679 0.014735
## 16 0.0024100 58 0.51285 0.75514 0.014717
## 17 0.0022494 63 0.50064 0.75450 0.014718
## 18 0.0021422 69 0.48715 0.75450 0.014718
## 19 0.0019280 75 0.47429 0.75964 0.014706
## 20 0.0016710 95 0.43509 0.76542 0.014691
## 21 0.0016067 101 0.42481 0.76285 0.014698
## 22 0.0012853 105 0.41838 0.76285 0.014698
## 23 0.0010000 114 0.40617 0.77378 0.014667
Cool, so rpart automatically computes all of the cross-validated errors for trees with cp = 0.001 and up! We can also see these results visually using the plotcp command. In this plot: - size of tree = (number of splits in tree) + 1 - the dotted line occurs at 1 std. dev. above the minimum xerror
plotcp(treeBig)
A rule of thumb is to select the cp value which first goes below the dotted line, and then prune the tree using this value.
treeFinal <- prune(treeBig, cp = 0.009)
prp(treeFinal)
We can obtain predictions using the predict() function:
predTrain <- predict(treeFinal, newdata = dataTrain, type = "class")
predTest <- predict(treeFinal, newdata = dataTest, type = "class")
Let’s evaluate the accuracy of our model:
print("Training Set")
## [1] "Training Set"
t <- table(dataTrain$Y, predTrain)
t # Confusion Matrix
## predTrain
## Terrible Low Mid High Perfect
## Terrible 57 38 0 15 64
## Low 11 147 8 102 141
## Mid 4 95 32 256 115
## High 1 51 22 450 42
## Perfect 19 4 2 110 336
sum(diag(t)) / nrow(dataTrain) # Accuracy
## [1] 0.4816211
print("Testing Set")
## [1] "Testing Set"
t <- table(dataTest$Y, predTest)
t # Confusion Matrix
## predTest
## Terrible Low Mid High Perfect
## Terrible 19 15 0 1 23
## Low 2 48 4 31 51
## Mid 3 25 2 85 53
## High 1 14 4 147 22
## Perfect 8 1 0 47 101
sum(diag(t)) / nrow(dataTest) # Accuracy
## [1] 0.4483734
45% is not bad out-of-sample, when you consider that there are 5 possible categories. Can the other models do better?
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Fit the forest with 200 trees, removing a couple of the words which
# contain non-alphanumeric characters.
# Technical Note: The random forest function complains if we try to include
# variable names with non-alphanumeric characters.
modelForest <- randomForest(Y ~ .,
data = dataTrain[,c(1:219,221:749,751:1137)],
ntree = 200)
Make predictions on the test set
predTrain <- predict(modelForest, newdata = dataTrain[,c(1:219,221:749,751:1137)])
predTest <- predict(modelForest, newdata = dataTest[,c(1:219,221:749,751:1137)])
Check accuracy
print("Training Set")
## [1] "Training Set"
t <- table(dataTrain$Y, predTrain)
t # Confusion Matrix
## predTrain
## Terrible Low Mid High Perfect
## Terrible 173 0 0 0 1
## Low 0 402 0 0 7
## Mid 0 0 501 0 1
## High 0 0 0 566 0
## Perfect 1 0 0 0 470
sum(diag(t)) / nrow(dataTrain) # Accuracy
## [1] 0.9952875
print("Testing Set")
## [1] "Testing Set"
t <- table(dataTest$Y, predTest)
t # Confusion Matrix
## predTest
## Terrible Low Mid High Perfect
## Terrible 15 20 4 2 17
## Low 0 33 36 23 44
## Mid 0 8 42 66 52
## High 0 1 18 152 17
## Perfect 3 0 14 32 108
sum(diag(t)) / nrow(dataTest) # Accuracy
## [1] 0.4950495
So Random Forest increases our out-of-sample accuracy to ~47.5%. For this model, we can print out a Variable Important plot:
varImpPlot(modelForest)
We could also try to print out a sample tree, but these will be very deep and not very interpretable.
Finally, let’s try a boosted tree model. We will use the package XGBoost, which is one of the top methods for winning online Kaggle competitions. Here is some more documentation for this package: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html
There are many parameters which are important to tune for this model. These include: - eta : learning rate. This is the gradient step size in the algorithm. So small values (eta <= 0.01) will take a long time to converge, while large values will converge quickly and may underfit. - nrounds : Number of trees in the Boosting algorithm. This should be tuned in relation to “eta”. - max.depth : Maximum depth of the trees. Can be tuned for better performance, but typically very short trees or stumps perform well.
library(xgboost)
## Warning: package 'xgboost' was built under R version 3.4.4
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
X_train <- model.matrix(Y ~ ., data = dataTrain)[,-1]
X_test <- model.matrix(Y ~ ., data = dataTest)[,-1]
# For multiclass problems with K classes, XGBoost requires the labels to
# go from 0 to K-1. Since factor variables go from 1,...K, we subtract 1 here.
dtrain <- xgb.DMatrix(data = X_train, label = as.integer(dataTrain$Y) - 1)
dtest <- xgb.DMatrix(data = X_test, label = as.integer(dataTest$Y) - 1)
watchlist <- list(train = dtrain, test = dtest)
params <- list(
objective = "multi:softprob",
eval_metric = "mlogloss",
num_class = 5, # For multiclass problems, we write here number of classes K
eta = 0.1,
max_depth = 2)
nthread <- 1 # Number of threads to use on the computer
nrounds <- 100
set.seed(123)
boostedTree <- xgb.train(params = params, data = dtrain,
nrounds = nrounds, nthread = nthread,
watchlist = watchlist)
## [1] train-mlogloss:1.566804 test-mlogloss:1.574540
## [2] train-mlogloss:1.529574 test-mlogloss:1.543979
## [3] train-mlogloss:1.497253 test-mlogloss:1.518770
## [4] train-mlogloss:1.468007 test-mlogloss:1.494595
## [5] train-mlogloss:1.441120 test-mlogloss:1.473930
## [6] train-mlogloss:1.417165 test-mlogloss:1.456853
## [7] train-mlogloss:1.395467 test-mlogloss:1.440404
## [8] train-mlogloss:1.374466 test-mlogloss:1.424539
## [9] train-mlogloss:1.355929 test-mlogloss:1.413812
## [10] train-mlogloss:1.339434 test-mlogloss:1.401887
## [11] train-mlogloss:1.323851 test-mlogloss:1.390460
## [12] train-mlogloss:1.309225 test-mlogloss:1.381427
## [13] train-mlogloss:1.295569 test-mlogloss:1.372312
## [14] train-mlogloss:1.282944 test-mlogloss:1.363736
## [15] train-mlogloss:1.270612 test-mlogloss:1.355640
## [16] train-mlogloss:1.258374 test-mlogloss:1.348399
## [17] train-mlogloss:1.247609 test-mlogloss:1.341230
## [18] train-mlogloss:1.237245 test-mlogloss:1.335660
## [19] train-mlogloss:1.227055 test-mlogloss:1.328611
## [20] train-mlogloss:1.217489 test-mlogloss:1.322743
## [21] train-mlogloss:1.208410 test-mlogloss:1.316952
## [22] train-mlogloss:1.199686 test-mlogloss:1.312633
## [23] train-mlogloss:1.191423 test-mlogloss:1.308705
## [24] train-mlogloss:1.183369 test-mlogloss:1.304309
## [25] train-mlogloss:1.175521 test-mlogloss:1.299025
## [26] train-mlogloss:1.167961 test-mlogloss:1.294592
## [27] train-mlogloss:1.160444 test-mlogloss:1.289609
## [28] train-mlogloss:1.153538 test-mlogloss:1.285131
## [29] train-mlogloss:1.146878 test-mlogloss:1.281997
## [30] train-mlogloss:1.140206 test-mlogloss:1.277410
## [31] train-mlogloss:1.134315 test-mlogloss:1.274387
## [32] train-mlogloss:1.127629 test-mlogloss:1.270981
## [33] train-mlogloss:1.121957 test-mlogloss:1.267450
## [34] train-mlogloss:1.116110 test-mlogloss:1.264185
## [35] train-mlogloss:1.110761 test-mlogloss:1.261163
## [36] train-mlogloss:1.105527 test-mlogloss:1.258574
## [37] train-mlogloss:1.099923 test-mlogloss:1.255841
## [38] train-mlogloss:1.094668 test-mlogloss:1.252924
## [39] train-mlogloss:1.089838 test-mlogloss:1.249797
## [40] train-mlogloss:1.084797 test-mlogloss:1.247731
## [41] train-mlogloss:1.080003 test-mlogloss:1.244753
## [42] train-mlogloss:1.075240 test-mlogloss:1.243580
## [43] train-mlogloss:1.070701 test-mlogloss:1.241751
## [44] train-mlogloss:1.066336 test-mlogloss:1.239819
## [45] train-mlogloss:1.062062 test-mlogloss:1.237804
## [46] train-mlogloss:1.057435 test-mlogloss:1.235940
## [47] train-mlogloss:1.052973 test-mlogloss:1.234079
## [48] train-mlogloss:1.049040 test-mlogloss:1.232525
## [49] train-mlogloss:1.044791 test-mlogloss:1.230383
## [50] train-mlogloss:1.040891 test-mlogloss:1.228033
## [51] train-mlogloss:1.036912 test-mlogloss:1.226314
## [52] train-mlogloss:1.032748 test-mlogloss:1.225106
## [53] train-mlogloss:1.029062 test-mlogloss:1.224244
## [54] train-mlogloss:1.025053 test-mlogloss:1.223026
## [55] train-mlogloss:1.021513 test-mlogloss:1.220692
## [56] train-mlogloss:1.017879 test-mlogloss:1.219475
## [57] train-mlogloss:1.014563 test-mlogloss:1.218697
## [58] train-mlogloss:1.010892 test-mlogloss:1.217827
## [59] train-mlogloss:1.007726 test-mlogloss:1.216723
## [60] train-mlogloss:1.004298 test-mlogloss:1.215281
## [61] train-mlogloss:1.000969 test-mlogloss:1.214648
## [62] train-mlogloss:0.997644 test-mlogloss:1.214504
## [63] train-mlogloss:0.994448 test-mlogloss:1.213299
## [64] train-mlogloss:0.991105 test-mlogloss:1.211877
## [65] train-mlogloss:0.987848 test-mlogloss:1.210477
## [66] train-mlogloss:0.984571 test-mlogloss:1.209312
## [67] train-mlogloss:0.981382 test-mlogloss:1.208515
## [68] train-mlogloss:0.978425 test-mlogloss:1.207316
## [69] train-mlogloss:0.975413 test-mlogloss:1.206419
## [70] train-mlogloss:0.972293 test-mlogloss:1.204307
## [71] train-mlogloss:0.969384 test-mlogloss:1.204427
## [72] train-mlogloss:0.966243 test-mlogloss:1.203319
## [73] train-mlogloss:0.963548 test-mlogloss:1.202075
## [74] train-mlogloss:0.960367 test-mlogloss:1.200667
## [75] train-mlogloss:0.957406 test-mlogloss:1.198975
## [76] train-mlogloss:0.954592 test-mlogloss:1.197262
## [77] train-mlogloss:0.951789 test-mlogloss:1.196298
## [78] train-mlogloss:0.948900 test-mlogloss:1.195658
## [79] train-mlogloss:0.946345 test-mlogloss:1.195372
## [80] train-mlogloss:0.943247 test-mlogloss:1.193742
## [81] train-mlogloss:0.940740 test-mlogloss:1.193286
## [82] train-mlogloss:0.938058 test-mlogloss:1.192671
## [83] train-mlogloss:0.935579 test-mlogloss:1.191937
## [84] train-mlogloss:0.932535 test-mlogloss:1.190958
## [85] train-mlogloss:0.930082 test-mlogloss:1.189354
## [86] train-mlogloss:0.927524 test-mlogloss:1.188316
## [87] train-mlogloss:0.925077 test-mlogloss:1.188252
## [88] train-mlogloss:0.922644 test-mlogloss:1.187224
## [89] train-mlogloss:0.920334 test-mlogloss:1.187318
## [90] train-mlogloss:0.917842 test-mlogloss:1.186612
## [91] train-mlogloss:0.915554 test-mlogloss:1.185548
## [92] train-mlogloss:0.913153 test-mlogloss:1.185273
## [93] train-mlogloss:0.910597 test-mlogloss:1.184518
## [94] train-mlogloss:0.908087 test-mlogloss:1.183620
## [95] train-mlogloss:0.905358 test-mlogloss:1.182822
## [96] train-mlogloss:0.903516 test-mlogloss:1.182201
## [97] train-mlogloss:0.900974 test-mlogloss:1.182170
## [98] train-mlogloss:0.898526 test-mlogloss:1.181829
## [99] train-mlogloss:0.896312 test-mlogloss:1.180648
## [100] train-mlogloss:0.894040 test-mlogloss:1.179544
Find the probability predictions for each class:
predTrainLongVec <- predict(boostedTree, X_train)
predTestLongVec <- predict(boostedTree, X_test)
predTrainMatrix <- matrix(predTrainLongVec, ncol = 5, byrow = T)
predTestMatrix <- matrix(predTestLongVec, ncol = 5, byrow = T)
predTrain <- apply(predTrainMatrix, 1, which.max)
predTest <- apply(predTestMatrix, 1, which.max)
Find the accuracy:
print("Training Set")
## [1] "Training Set"
t <- table(dataTrain$Y, predTrain)
t # Confusion Matrix
## predTrain
## 1 2 3 4 5
## Terrible 111 19 18 2 24
## Low 10 213 54 26 106
## Mid 4 14 266 121 97
## High 1 6 40 481 38
## Perfect 4 5 12 48 402
sum(diag(t)) / nrow(dataTrain) # Accuracy
## [1] 0.6941565
print("Testing Set")
## [1] "Testing Set"
t <- table(dataTest$Y, predTest)
t # Confusion Matrix
## predTest
## 1 2 3 4 5
## Terrible 25 13 3 0 17
## Low 7 30 40 19 40
## Mid 1 17 43 64 43
## High 0 7 28 136 17
## Perfect 3 3 11 34 106
sum(diag(t)) / nrow(dataTest) # Accuracy
## [1] 0.4809052
It looks like we have increased the out-of-sample accuracy slightly more, up to 48.9% out-of-sample.
We can plot the first few trees in our XGBoost model:
# install.packages("DiagrammeR")
library(DiagrammeR)
## Warning: package 'DiagrammeR' was built under R version 3.4.3
# First Tree to predict "Terrible"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
trees = 0)
# First Tree to predict "Low"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
trees = 1)
# First Tree to predict "Medium"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
trees = 2)
# First Tree to predict "High"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
trees = 3)
# First Tree to predict "Perfect"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
trees = 4)
# Second Tree to predict "Terrible"
xgb.plot.tree(feature_names = names(dataTrain)[2:1137], boostedTree,
trees = 5)
Each individual tree is reasonably interpretable. However, in order to understand this model, it requires much more analysis because there are 500 trees total.
We can also compute the variable importance scores for this XGBoost model.
xgb.importance(feature_names = names(dataTrain)[2:1137], boostedTree)
## Feature Gain Cover Frequency
## 1: good 9.889497e-02 2.139099e-02 0.0166666667
## 2: stay 7.106747e-02 2.203154e-02 0.0140000000
## 3: beauti 5.451984e-02 2.911681e-02 0.0260000000
## 4: great 5.097638e-02 3.822557e-02 0.0280000000
## 5: dirti 4.674607e-02 4.215097e-02 0.0426666667
## ---
## 289: tidi 7.863741e-05 1.491958e-05 0.0006666667
## 290: inde 7.638627e-05 8.951719e-06 0.0006666667
## 291: pick 4.654224e-05 2.970406e-05 0.0006666667
## 292: grand 3.354063e-05 2.538920e-05 0.0006666667
## 293: green 6.029264e-06 2.072554e-05 0.0006666667